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(54) Array processor communication arcliitecture witli broadcast instuctions 



(57) A plurality of processor elennents (PEs) are 
connected in a cluster by a common instruction bus to 
an instruction memory. Each RE has data buses con- 
nected to at least its fournearest RE neighbors, referred 
to as its North, South, East and West PE neighbors. 
Each PE also has a general purpose register file con- 
taining several operand registers. A common instruction 
is broadcast from the instruction memory over the in- 
struction bus to each RE in the cluster. The instruction 
includes an opcode value that controls the arithmetic or 
logical operation performed by an execution unit in the 
PE on an operand from one of the operand registers In 
the register file. A switch is included in each PE to inter- 
connect it with a first PE neighbor as the destination to 
which the result from the execution unit is sent. The 
broadcast instruction includes a destination field that 
controls the switch in the PE. to dynamically select the 
destination neighbor PE to which the result is sent. Fur- 
ther, the broadcast instruction includes a target field that 
controls the switch In the PE. to dynamically select the 
operand register in the register file of the PE, to which 
another result received from another neighbor PE in the 
cluster is stored. In this manner, the instruction broad- 
cast to at! the PEs in the cluster, dynamically controls 
the communidatk>n of operands and results between the 
PEs in the cluster, in a single instruction, multiple data 
processor array. 



PE-NET 



0-Bus 
Swttch 



- ♦<S/-}I 



■»^VNH fc- 



PE-MFT 



(ExA) (E,B) 
TtansmBN TransmirS 



TmSE 



a O f^Kssems me tpui. Oim or BI-ithaG^ 



Neaiest-rfaigmMr Conmutadon Eunvtes ^ 
. FIG6A 



Pthted by Jouve. 75001 RiVRtS (FR) 



EP0726S32 A2 



PE-NET 
Switch 




— (W/-)0-^ 
-^-/E)l '■ 



(ExD) 
Transmit W 
Receive E 



FIG6A 



FIG 6B 



FIG 6 



FIG6B 



2 



EP0726532A2 

Description 

FIELD OF THE INVENTION 

s The invention disclosed broadly relates to data processing systems and method and nfx>re particularly relates to 

Improvements in array processor architectures. 

OBJECTS OF THE INVENTION 

10 It is therefore an object of the invention to provide an improved instniction driven programmable parallel processing 

system. 

It is another object of the invention to provide an improved parallel processing system that reduces the inherent 
latency in communicating between procesors. 

It is still a further object of the invention to provide an improved parallel processing system which has Improved 
IS performance characteristics. 

SUMMARY OF THE INVENTION 

These and other objects, features and advantages are accomplished by the invention. A plurality of processor 
20 elements (PEs) are connected in a cluster by a common instruction bus to an instruction memory. Each PE has data 
buses connected to at teast its four nearest PE neighbors, referred to as its North, South. East and West PE neighbors. 
Each PE also has a general purpose register file containing several operand registers. A common instruction Is broad- 
cast from the Instruction memory over the Instruction bus to each PE in the cluster. The instruction includes an opcode 
value that controls the arithmetic or logical operation perfomned by an execution unit in the PE on an operand from 
2S one of the operand registers in the register file. A switch is included in each PE to interconnect it with a first PE neighbor 
as the destination to which the result from the execution unit is sent. In accordance with the invention, the broadcast 
instnjction includes a destination field that controls the switch in the PE. to dynamically select the destination neighbor 
PE to which the result is sent. Further in acordance with the invention, the broadcast instruction includes a target field 
that controls the switch in the PE, to dynamically select the operand register in the register file of the PE, to which 
50 another resuit received from another neighbor PE in the cluster is stored. In this manner, the Instruction broadcast to 
all the PEs in the cluster, dynamically controls the communication of operands and results between the PEs in the 
cluster, in a single instruction, multiple data processor anay 

DESCRIPTION OF THE FIGURES 

3S 

Fig. 1 is a high level array machine organization diagram for multiple control units. 

Fig. 2 is an example instruction format for communications, in accordance with the invention. 

Fig. 3 is a second example instruction format for communications, in accordance with the invention. 

Fig. 4 is a single processor element (diagonal) node flow with connection interfaces. 
40 Fig. 5 is a dual processor element node flow with connection interfaces. 

Fig. 6 Is a nearest-neighbor communication example in a single processor element node. 

Fig. 7 is a nearest-neighbor conrtmunicatron example in a dual processor element node. 

Fig. 8 is a logical and folded mesh representation of adjacent processor element column communications. 

Fig. 9 is a logical and folded mesh representation of adjacent processor element row communications. 
45 Fig. 10 Is a flow organization of a processing element, depicting the flow of an individual processing element, 

showing all its arithmetic facilities and the points to And for connectbn to switch logic and paired processor elements. 

Fig. 11 is a general form diagram of the data select unit. 

Fig. 12 is an example of a data select unit in use, wherein byte B of the source is placed in the bw order 8 bits of 
the destination register and all the remaining desthation bits are forced to be the same as the sign of the byte, as 
so performed by the data selector logic. 

Fig. 1 3 is a flow chart showing folded array fast odd/even symmetric 1 -D DCT. 

Fig. 14 illustrates the execution of a butterfly surrogate. 

Fig. 15 illustrates the first execution of multiply add and DSU send surrogate. 

Fig. 16 illustrates the second execution of multiply add and DSU send surrogate. 
ss Fig. 17 illustrates the third execution of multiply add and DSU send sunpgate. 

Fig. 18 illustrate the fourth execution of multiply add and DSU send surrogate. 

Fig. 19 illustrates the execution of the butterfly with a clustered processor element destination surrogate. 
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l-Bus Switch 


A — 


lA'OUs ron connects to tne lop rb, id-ous Port connects to the Bottom PE 


1 = 


lA-Bus Port connects to the Bottom PE, IB-Bus Port connects to the Top PE D-Bus Switch 




L/M-Dus ron cunnecis lo ine lop r^c, ud-dus r^on connects lo me oonom Pc 


1 _ 

1 — 


UM-Dus ron cunnecis lo ine Doiiom rc, ud-dus rx>n connects lO cne lop Pc iNearesi neiQnDor/Aojscent-Pc 




Communications Mode 


0 = 


Nearest Neighbor Communications Enabled 


1 = 


Adjacent-PE Conununications Enabled Ring/Aaay Mode 


00 = 


Row Rings: N and S Ports Disabled, E and W Ports Enabled 


01 = 


Column Rings: E and W Ports Disabled, N and S Ports Enabled 


10 = 


Reserved 


11 = 


Array Mode A Load Offset Register instruction provides byte loading of the PE's offset register used in Case 




Surrogate instructions. 



IS 

The PE flow diagrams depicted in Fig. 4 and Fig. 5 show switches which are controlled from instructions being 
executed Three types of switches are indicated in the nodes: the PE-NET Switch, Data-Bus (D-Bus) Switch, and 
Instruction-Bus (l-Bus) switch (only in the dual-PE node). The PE-NET switches are controlled by instructions executing 
in the PEs; the l-Bus and D-Bus switches are controlled by the PE-mode control register The PEs exchange data in 

20 different ways among nodes by controlling the PE-NET switches. 

The Nearest-Neighbor (PE-NET) transmitting/receiving facilities for dual-PE nodes include four bi-directional or 
four input and four output driver/receiver ports. For the separate Input/output port case, there is' an input and output 
pair assigned to each nearest neighbor direction. In general, when one of these ports is assigned to transmit data to 
a nearest-neighbor PE, another port will also be directed to simultaneously receive data from a different PE, In the 

2S dual-PE node, Controls are provided to assure that only two of the driver/receiver ports are simultaneously transmitting 
data and two others are simultaneously receiving data. There are four specific cases to consider they are Transmit 
East Receive West, Transmit North Receive South. Transmit South Receive North, and Transmit West Receive East, 
The diagonal PEs (Fig. 4) share the West/North ports and the South/East ports and thus require only two nearest- 
neighbor type I/O ports per diagonal PE. Note that in the dual-PE nodes, the receiving/transmitting mechanisms consist 

30 of four I/O ports. 

Fig. 6 and Fig. 7 depict the single-PE and dual-PE nodes in a less detailed form than Fig, 4 and Fig. 5 and show 
examples of how the nearest-neighbor ports are used in supporting four possible transmission modes. The communi- 
cation modes supported cause data to be transmitted in the four cardinal directions and simultaneously received from 
the opposite directbn. For example, transmit North and receive from the South. 
3S Fig. 8 and Fig. 9 depict, for a four neighborhood array, the logical representation (i.e. on the unfolded mesh) and 

the folded mesh representation of the four communications modes possible: 

Col-Ring 0 ~ Col-Ring 1 & Col-Ring 2 - Col-Ring 3 Figs. 8A and 8B. 
Col-Ring 0 - Col-Ring 3 & Col-Ring 1 - Col-Ring 2 Fig. 8C and BD. 
40 Row-FHing 0 - Row-Ring 3 & Row-Ring 1 - Row-Ring 2 Fig. 9C and 9D 

Etow-Rffig 0 - Row-Ring 1 & Row-Ring 2 « Row-Ring 3 Fig. 9D and 9B 

The folded-array nodes each contain one or two Processing Elements as shown in Fig. 4 and Fig. 5. The PEs. 
which are all identical, each contain two general types of arithmetic units &dash. an ALU and a fixed point/floating point 

4S .Multiply/Add Unit (MAU) &dash. a Data Selection Unit (DSU), a Local Load. Data Address Generator, a Local Store 
Data Address Generator, and a set of 32 GPRs which serves to hold operands and working results for operations 
performed bi the node. The register set is referred to as the General Purpose Register File, or GPRF for short. A view 
of the PE data flow organization, showing its individual processing elements and GPR file, is shown in Fig. 10. 

Three classes of Multiply/Add Unit MAU Instructions are architectured, one for 16x16 single 32-bit fixed point 

so results, a second for 32x32/dual-16x16 double 32-bit fixed point results, and a third for Single Precision Real Single 
Precision Real floating point multiply-add results. An array processor can be designed to support any one of these 
options and operate as subsets of the full architecture. With a 64-bit result (or double 32-bit results) the low 32-bit half 
uses bus Q. For an implementation with a 16x16 MAU only, the Q bus is not used. For a 32-bit Mfast processor the 
32x32 MAU instructions are only able to write their results back to a local GPRF. The DEST destkiation field is not 

ss used in 32x32. MAU instructtons and the second 32-bit result is written back to the target register specified plus one. 
Use of the clustered communications and nearest neighbor interface for the 32x32 MAU instructions is reserved for 
future machines wrth 64-bit Nearest Neighbor ports. Note that the Processing Element flow does not include any of 
the switch logic required for communk:atk>n arTK>ng array nodes; that logic is unique to the nodes themselves. The 



6 



EP0726 532A2 



idea, here, is that the logic for the PE can be designed as a nnacro and used repeatedly to build the array It is intended 
for the PE to be designed with 6 unconnected GPRF input ports, and allow these inputs to be connected in the manner 
appropriate for the particular node in which the PE is included. Fig. 10 indicates typical direct connections (shown in 
. dashed lines) from the ALU, MAU, and DSU. A more complete picture of these direct connections is shown in the 
s Sbigle^^E node flow diagram (Fig. 4). 

The GPRF nnput ports may also be multiplexed as shown in the Dual-PE node flow diagram (Fig. 5 ). 
The Data Select Unit (DSU) is used in register-to-register moves and data shift operations. In the specific situation 
when the move destination is also the source for another move (registers are codestinations between paired-PEs), a 
SWAP function can be achieved. The general form of the DSU is shown in Fig. 11 : 
70 The logic in the Data Selector is used to modify the data passing from source to destination for those instructions 
which require it. For example, when a byte is selected from the source register and then loaded into the destinatbn in 
sign-extended form, the Data Selector will perform the byte-alignment and sign-extension functions. A simple example 
of this kind of operation is shown in Fig. 12. 

Three types of data select/move operations are provided by the Data Selector word move, hatfword move, and 
IS byte move. Within these types of moves, certain variations are supported: 

Word move 
Hatfword move 

20 Any source half word to any destination halfword 

Any source halfword to low half of word and: 

High hatfword forced to all zeros 
High hatfword forced to all ones 
25 High hatfword forced to bw halfword sign value 

Byte nrxyve 

Any source byte to any destination byte 
30 Any source byte to bw destinatbn byte and: 

High bytes forced to all zeros 
High bytes forced to all ones 
High bytes forced to low byte sign value 

35 

High or tow source byte pairs (bO and bt, b2 and b3) to destination bytes bl and b3, and: 

High bytes forced to all zeros 
High bytes forced to all ones 
40 High bytes forced to tow byte sign value 



When a PE execution unit performs an operations, the resulting output (P, U, and/or S) is sent to a destination 
register which may be in the same PE as the execution unit (a local register), in a paired-PE (in a dual-PE node), in a 

45 . nearest-neighbor (NN) PE, or in an adjacent-PE. In alt cases, the destination register (Rm, Rt, and/or Rs) is specified 
in conjunction with the instruction's Destination (DEST) field. Table 1 and Table 2 list the destination (DEST) options 
presently defined. This field may be riKxiified for specific instructions. For surrogate instructions only one execution 
unit is specified to use the nearest neighbor interface. 

A folded array 2 dimensbnal (2D) discrete cosine transform is prented next. 

50 The signal flow graph for the symmetric discrete cosine transform (DOT) is shown in Fig. 1 3. Note that the outputs 

are scaled by the 2G(u)/N where C(u)=1/sqrt 2 for u=0 and C(u)=1 othenMse. Note that c#x=cos(# pi/16) and that 1/ 
(4 sqrt 2) = 1/4 c4x. 

For the 2-D'DCT a 1-D DOT on the columns folbwed by a 1-D DOT on the Rows produces the 2-D DOT result. A 
.multiplication accumulate and register transfer via the nearest neighbor ports procedure is used. Since the butterfly 
55 results are 16-bits and the nearest neighbor ports are 32-bits both odd and even butterfly values can be sent between 
PEs each cycle. With dual 16x16 multipliers in each PE both odd and even parts of the 1-D DOT for 4-columns can 
be calculated in the same four cycles. The column 1-D DOT equations shown in Table 3. The lower case 'z' in the 
folbwing lists represents the column # being processed. Fig. 14 illustrates the execution of the butterfly surrogate. Fig. 
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15 illustrates the first execution of the multiply add and DSU send sunogate. Fig. 16 illustrates the second execution 
of the mufl9)ly add and DSU send surrogate. Fig. 17 illustrates the third execution of the multiply add and DSU send 
surrogate. Fig. 18 illustrates the fourth execution of the multiply add and DSU send surrogate. Fig. 19 illustrates the 
execution of the butterfly with the clustered processor element destination surrogate. 

5 Another 4-cycles finishes all 8-coIumn 1-D DCTs. The scaling of the output by 1/4 is not done at this point but 

rather the combined scaling of 1/16 for the 2-D DCT is included in the calculation of the quantization table values. 
Consequently, the procedure continues with doing the 1 -D DCT on the row values. First a half-word butterfly surrogate 
instruction creates all the row butterfly values placing the values in the pattem shown in Fig. 1 9 where "z", of Az through 
Hz, represents the row # now instead of the column #. Note that the half-word butterfly instructions send their results 

10 to the paired-PE register instead of a local register. This Is a communication operation between the dual PEs used to 
ensure the data lines up with the coefficients. 





/\z = fOz + f7z 




Bz = f1z + f6z 


75 


G2 = f2z + f5z 




Dz = f3z + f4z 




Ez = f3z-f42 




Fz = f2z-f5z 




Gz=:f1z-f6z 


20 


Hz = f02-f72 



Next a sequence of eight multiply-add-send operations are completed followed by a scaling (shift) operation to 
conclude the 2-D DCT. Note that the rows are done in a different order than the columns were done. With the even 
rows in the first set of four 32 multiplication operations followed by the odd rows in the second set of four 32 multipli- 
es cations. In the first 1-D DCT columns 0-3 were done first followed by columns 4-7. The end result is the same accom- 
plishing the 2-D DCT on the whole 8x8 array In the JPEG and MPEG algorithms a quantization step follows the 2-D 
DCT. in which case the scaling step can be included in the quantization step. The total number of cycles for an 8x8 
2-D DCT (excluding scaling and quantization) is 18 cycles. 
The problems and how they are solved summary: 

30 

1 . provide a programmable low latency communication mechanism between processing elements in an array of 
processing elements, 

The specification of the destination of results from functional execution units is changed from always the local 
ss processor's storage (register file), to any direct attached processor's storage (register file). 

2. pipeline across the anay of processing elements, 

Due to the communication of results between processing elements with zero latency, computatbns can be pipelined 
40 across the array of processing elements. 

3. communicate between clustered processing elements. 

Specialized Data Select Unit instmctions and expansion of the direct attached destination specification to include 
45 . clustered processing elements provides the zero latency communicatiohs between the clustered processing elements 
and the ability to pipeline across the total interconnected array 

Although a specific embodiment of the invention has been disclosed, it will be understood by those having skill in 
the art that changes can be made to that specific embodiment without departing from the spirit and the scope of the 
invention. 

so 

Table 1 



Nearest Neightx)r Result Destination (For X substitute P/U/S) 


DEST 


Single-PE Action 


Dual-PE Action 


0000 


local GPRF Register <- X 


local GPRF Register <- X 


0001 


local GPRF Register <- X 


Transpose-PE GPRF Register <- X 
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DISCUSSION OF THE PREFERRED EMBODIMENT 

Fig. 1 depicts a high level view of the Mwave array processor machine organization. The machine organization is 
partitioned into three main parts: the System Interfaces including Global Memory 100 and extemal I/O. multiple Control 
5 Units 103 with Local Memory 102, and the Execution AnBy with Distributed Control PEs 104. The System Interface is 
an application-dependent interface through which the Mwave array processor interfaces with Gtobal Memory, the t/O, 
other system processors, and the personal computer/workstation host. Consequently, the System Interface will vary 
depending upon the application and the overall system design. The Control Units 103 contain the local memory 102 
' for nnstruction and data storage, instruction fetch (l-Fetch) mechanisms, and operand or data fetch mechanisms (D- 
10 - .Fetch). The Execution Array with Distributed Control PEs 104 is a computational topology of processing elements 
chosen for a particular application. For example, the anBy may consists of N Processing Elements (PEs) 104 per 
control unit 103, with each PE containing an Instruction Buffer (IMRY) 106, a General Purpose Register File (GPRF) 
108, Functional Execution units (FNS) 110. Communication Facilities (COM) 112, and interfaces to its Instmction bus 
114 and its data buses. The PEs may also contain PE-local instmction and data memories. Further, each PE contains 
15 an instruction decode register 116 shown in Figure 4, which supports distributed control of the multiple PEs. Synchro- 
r nism of local memory accessing is a cooperative process between the control units 103, local memones 102. and the 
. • PEs 104. The array of PEs allows computation functions (FNS) to be executed In parallel in the PEs and results to be 
communicated (COM) between PEs. 

- With the multiple single mstmction, multiple data (MSIMID) machine organizatbn, e.g. Fig. 1 , it is possible tocreate 
so single or multiple thread machines wherein the topology of PEs and communication facilities can be configured for a 
more optimum topology depending upon the application. For example, some possible machine organizations are; 
multiple linear rings, a nearest neighbor 2*dimension mesh array, a folded nearest neighbor 2-dimension mesh, mul- 
tiple-fold mesh, a 2-dimenslon hexagonal array, a folded 2-dimension hexagonal array, a folded mesh of trees, com- 
binations of the at)ove as well as others. 
2S The basic concept Involved in the Mwave array processor family is that the communications of results between 

direct-attached processing elements can be specified in the processing element instruction set architecture 115. In a 
typical RISC uni-processor, the destination of results from functional execution units is innptied to be the processor's 
own register file breaks this 'tradition', changing the definitbn of result destinatk>n, an implied single processor local 
destination, a directly-connected multiple processor destinatk>n. With this architecture and directly^XHinected PE-to- 
30 PE links, we can claim that communicatfons between the directly connected PEs can be done with zero communwatlon 
latency. The architecture provides this capability by Including a destlnatio field 120 in the simplex instnjctions 115 
indicating the directly-connected succeeding PE where the result target register reskJes, which is specified by the target 
field 126. Figure 2 shows a basic simplex 32-bit Instruction 115 used In the processor, though 16-bit, 64-bit, etc. fomriats 
can use the same principle. For a four PE neighborhood, a North, East, West, and South PE destination is coded in 
3S : the Destination Extension field 120 is providing a 4-bit fieki here to allow for growth up to eight clustered processors 
(used in a three fold array) and up to an eight neighborfiood array. Examples of topok)gies with direct-attached PEs 
are a nearest neighbor mesh, a folded mesh, a tree array, hypercubes, etc. It should also be appreciated that due to 
the clustering and nearest neighbor mesh organizations, the Physical Design process will physk^lly place the directly 
connected PEs in close proximity to each other supporting short cycle times. Further, it should be appreciated that this 
40 binding of the communications destinatbn specificatton in the instarction set format is done purposely to. not only 
^provide 0-communication latency, but also to ensure hazard free communications in a SIMD array of processing ele- 
tments. For explanatk)n purposes only, the processor is used to describe the architectural details involved in imple- 
' '.menting the described communications concept. 

In the array processor instruction 115, the Operand-1 fieW 122, Operand-2 field 124. and Target field 126 are 
45 vif.eglster spojificatkxis and the Destinatkxi field 120 specifies the direct attached succeeding processing element to 
^ whfch the result is to be communicated. The opcode fieW 128 specifies the arithmetic or logical operation to be per- 
fontied by a selected executbn unit. 

In this type of SIMD array processors it is possible to tag the destination field as is shown in Figure 3, and use 
mode control instructkyis in the sequencer controller unrts to set the tag. The tags are then distributed to the PEs In 
50 the array. In order to change communicatbns directrans, a mode control instructkxi is Issued prior to the execution of 
4he instruction that communcates the result value. There are a number of Implbatlons in this approach. First, by tagging, 
'^.the instructkjn has full use of the instruction fieW for function definition. Second, Tagging incurs an additbnal latency 
: :.'Whenever the destination needs to be changed during communicating. If the toading of the tag register can be incor- 
;porated in a surrogate very targe instruction word (VLIW). then it becomes possible to change the destination on a 
55 cycle by cycle basis. This minimizes the effect of this latency at the expense of complex control algorithms. 

. The single, diagonally folded array processor Element (PE) data flow will be briefly described. The fokied-array 
nodes are of two types: diagonal nodes containing a single Processing Element, and all other nodes which contain 
iwo PEs each. The details of each type are discussed in the folbwing sections. 
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Table 1 (continued) 





Nearest Neighbor ResuR Destination (For X substitute PAJ/S) 




DEST 


Sffiale-PE Action 


Dual-PF Action 


5 


0010 


Hyper-cube Compiement-PE 
GPRF Reoister X 


Hyper-cube Comp!ement-PE GPRF 

Rficiistef *— X 




0011 


Reserved for Clustered 
Destinations 


Reserved for Clustered Destinations 


10 




Reserved for Clustered 
Destinations 


Reserved for Clustered Destinations 


IS 


0111 


Resented for Clustered 
Destinations 


Reserved for Clustered Destinations 


20 


1000 North 


• W/N Out Port <- Local-X 

• TRGT Register <- S/E in Port 
South-X 


• N/W Out Port 4r- Local-Xt 

• Top-PE TRGT Reg <~ S/E in Port 
South-Xt 

• W/N Out Port Local-Xb 

• Bot-PE TRGT Reg E/S in Port 
Souih-Xb 


25 
30 


1001 North East 


Resented (e.g. 8-NN: NE port) 


Resen/ed (e.g. 8-NN: NE port) 


1010 East 


• S/E Out Port <- Local-X 

• TRGT Register <- W/N in Port 
West-X 


•E/S Out Port <- Local-Xt 

• Top-PE TRGT Reg <- W/N in Port 
West-)a 

• S/E Out Port Local-Xb 

• Bot-PE TRGT Reg f- N/W in Port 

Wool /\.u 




1011 South East 


Reserved (e.g. 8-NN: SE port) 


Resen/ed (e.g. 8-NN: SE port) 


35 


1100 South 


• S/E Out Port <- Local-X 

• TRGT Register 4- W/N in Port 
North-X 


• S/E Out Port <" Local-Xt 

• Top>PE TRGT Reg <- N/W in Port 
North-Xt 

• E/S Out Port <- Local-Xb 

• Bot-PE TRGT Reg ^ W/N in Port 
North-Xb 


40 


1101 South West 


Reserved (e.g. 8-NN: SE port) 


Reserved (e.g. 8-NN: SE port) 




1110 West 


• W/N Out Port <- Local-X 
•TRGTRegister<- S/E In Port East- 
X 


• W/N Out Port Local-Xt 

• Top-PE TRGT Reg <- E/S in Port 
East-Xt 

• N/W Out Port <- Local-Xb 

• Bot-PE TRGT Reg <- S/E in Port 
East-Xb 




1111 Northwest 


Reserved (e.g. 8-NN: SE port) 


Resen/ed (e.g. 8-NN: SE port) 


SO 


Note: Single-PE nodes have two nearest-neighbor ports and Dual-PE nodes have four such ports. Use of these 
is depicted in Figure 6 on page 10 and Figure 7 on page 11 respectively/The notation Xt and Xb refers to the lop' 
and "bottom' PEs in a Dual-PE node shown in Figure 7 on page 11. 



55 



9 



EP0 726 532 A2 





Tabia 2 (Paso 1 oT 3). A^acent-^ Resdt Outfautten (Fbr X subitfluta PAi/S) 








uuaHrc Acnon 


5 


UUUv 




local 6rKr Kegisier «~ a 




0001 


tocal GPRF Register ^ X 


Transpose-PE GPRF Register X 


10 


0010 


Hyper-cube Complement-PE GPRF Register 


i4yper-cube Compiement-PE GPRF Register 
*-X 




0011 


Reserved for Clustered Destinations 


Reserved for Clustered Destinations 






Reserved for Qustered Destinations 


Reserved for Clustered Destinations 


IS 


0111 


Reserved for Clustered Destinations 


Reserved for Clustered Destinations 


20 


1000 
North 


• Even Row PEs 

• W/N Out Port Local-X 

• TRGT Register W/N In Port North-X 

Odd Row PEs 

• S/E Out Port *- Local-X 

• TRGT Register S/E In Port South-X 


Even Row PEs 

• N/W Out Port Local-Xt 

• Top-PE TRGT Reg l^ ln Port 
North-Xt 

• W/N Out Port ^ Local^Xb 

• Bot-P£ TRGT Reg - W/lvi In Port 
North-Xb 


25 






Odd Row PEs 


30 






• ^/P Hiif Pnrt 1 nr^aUYf 

• O/C wUl ron ^ ^pcai^At 

• Top-PE TRGT Reg S/E In Port 
South-Xl 

• E/S Out Port Local-Xb 

• 8ot-PE TRGT Reg E/S In Port 
South-Xb 


35 


1001 

North 

East 


Reserved (e.g. 8-NN: NE port) 


Reserved (e.g. 8-NN: NE port) 



40 



45 



SO 



55 
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TABLE 2 (page 2 of 3) 



TaUe 2 (Pago 2 of 3V Ad]acenH>E Result Oestnutfon 


DEST 


Slngle-PE Action . 


Dual-PE Action 


1010 
East 


EvQn Column PEs ' 

• S/E Out Port Local-X 

• TRGT Register S/E In Port East-X 

Odd Column PEs 

• W/N Out Port Local-X 

• TRGT Register 4- W/N In Port West-X 


Even Column PEs 

• Out Port ^ Local-Xt 

• Top-PE TRGT Reg In Port East-Xt 
« S/E Out Port 4- Lidcal-Xb . 

• Bot-PE TRGT Reg S/E In Port 
£ast-Xb 

Odd Column PEs 

• W/N Out Port 4- Local-Xt 

• Top-PE TRGT Reg W/N In Port 
West-Xt 

• N/W Out Port 4- Local-Xb 

• Bot-PE TRGT Reg ♦-NW In Port 
West-Xb 


1011 

South 

East 


Reserved (e.g. 8-NN: 5E port) 


Reserved (e.g. 8-NN: SE port) 


1011 
South 


Even Row PEs 

• S/E Out Port Local-X 

• TRGT Register 4- S/E In Port SouttwX 

Odd Row PEs 

• W/N Out Port ^ Locai-X 

• TRGT Register ^ W/N In Port North-X 


Even Row PEs 

• S/E Out Port 4- Local-Xt 

• Top-PE TRGT Reg S/E In Port 
South-)tt 

» E/S Out Port ^ Local-Xb 

• Bot-PE TRGT Reg ^ E/5 In Port 
South-Xb 

* 

Odd Row PES 

• NW Out Port ^ Local-Xt 

• Top-PE TRGT Reg ^ N/W In Port 
North-Xt 

• W/N Out Port Local-Xb 

• Bol-PE TRGT Reg ^ W/N In Port 
North-Xb 


ill 


Reserved (e.g. a4^N: SW port) 


Reserved {e.g. S^N: SW port) 
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10 


Tabte 2 (Pago 3 of /Uljacent-PeRawll Oesdnatta(FdrXsutetftite 




DEST 


SIngle-P6 Action 


OuaJ-PE Actfon 




1001 


Even Column PEs 


Even Column PEs 


IS 
20 


West 


• W/N Out Port Loeal^X 

• TRGT Register *-W/N In Port West-X 

Odd Column PEs 

• S/E Out Port ^ LocahX 

• TRGT Register «- S/E In Port East-X 


• WAN Out Port ^ Locaf-Xt 

• Top-PETRGTReg^-W/N In Port 
West-Xt 

• N/W Out Port Local-Xb 

• Bot-PETRGTReg-NminPort 
West-Xb 

Odd Column PEs 


25 






• E/S Out Port Local-Xt 

• Top-PE TRGT Reg E^S In Port East-Xt 

• S/E Out Port ^ Local-Xb 

• Bot-PE TRGT Reg *- S/E In Port 

East-Xb 


30 


1111 
North 

West 


Reserved {e.g. 8-NN: NW port) 


Reserved (e.g. 8-tNN: NW port) 


35 


Note: Single-PE nodes have two nearest-neighbor ports and Oual-PE nodes have four such ports. 
Use these Is depleted in Figure 6 on page 10 and Rgure 7 on page 11 respectively. The nota- 
tion Xt and Xb refers to the 'top' and 'bottom* PEs in a Ouar*PE node shown in Rgure 7 on 
page 11. 



TABLES 

40 

, 1 . Butterfly caiculation on each column Figure 14 

• Az = PzO + Pz7 

• Bz = Pzl + Pz6 
45 . . Cz = Pz2 + Pz5 

• Dz = Pz3 + Pz4 

• ' Ez = Pz3 - Pz4 

• Fz = Pz2 - Pz5 

• Gz=Pz1-Pz6 
so • Hz = PzO-Pz7 

2. 1st 32 Multiplies column z=0-3, Send Column 0-3 Pairs of Butterfly Results South (Figure 15): 

• rzO-l = Az(o4x) 
55 • f22-1 = 8z(c6x) 

• fe4-1 = Cz(-o4x) 

• fe6-1 = Dz(-c6;r) 

• frl-l = £z(c7x) 
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• fe3-1 = F2(-cx) 

• fe5-1 = Gz('CX) 

• fe7-1 = Hz(c7)^ 

• Send (Az.Hz), (Bz.Gz). (Cz.Fz). (Dz.Ez) to South PE 

3. 2nd 32 Muttqslies column z=:0.3. Send Column 0-3 Pairs of Butterfly Results South (Figure 16): 



f20-2 = (feO-1 = >\z(o4x)) + D2(o4x) 

fz2'2 = (fe2-1 = B4o6x)) + Az(c2x) 

fz4-2 = (/z4-1 = C2(-o4x)) + fi2(-o4;0 

fe6-2 = (/26-1 = Dz(-c6jf)) + Cz(c2x) 

/zl-2 = = £z(c7x)) + F2(c5jr) 

fe3-2 = (fe3^1 = F2(-c>f)) + Gz(-c7;0 

fe5-2 = (fe5-1 = Gz(-cx)) + H2(c5x) 

fe7-2 = (fe7-1 = Hz{c7j^) + £z(-cx) 

Send (Az.Hz). (Bz.Gz), (Cz.Fz), (Dz.Ez) to South PE 



4. 3rd 32 Multiplies column z=0-3, Send Column 0-3 Paris of Butterfly Results South (Figure 17): 



• f20-3 = (r20-2 = >tz(o4x) + Dz{0^^) + Cz-(c4j<) 

• fe2-3 = (fe2-2 = 8z(c6;0 + Az{c2j^) + 04-c2x) 

• fe4-3 = (/z4-2 :r C2(-o4x) + Sz(-c4x)) + Az(o\)^ 

• fe6-3 = (fe6-2 = Dz(-c6x) + C2(c2x)) + Bz{<!2)^ 

• fel -3 = (/n -2 = Ez(c7x) + Fz(c5x)) + Gz(c3x) 

• fe3-3 = (fe=3-2 = Fz(-C3e) + Gz(-c7x)) + Hz(c3x) 

• fe5-3 = {fe5-2 = G2(-cx) + Hz(c5x)) + £z(c3;0 

• 67-3 - (/z7-2 = Hz(c7a) + £z(-cjf)) + Fz(c3;0 

• Send (Az,Hz), (Bz.Gz). (Cz.Fz), (Dz.Ez) to South PE 

5. 4th 32 Multiplies column z=0-3, Send Column 0-3 Pairs or Butterfly Results South (Figure 18): 



• feO-4 = (feO-3 = i4z(o4x) + Dz(o4x) + Cz(o4;f)) + Bz(c4)^ 

• fe2-4 = (fe2-3 = Bz(c6x) + >\2(c2x) + Dz{-cZj^) + C2(-o6x) 

• fe4-4 = (fe4-3 = Cz{-c4)^ + fiz(-o4x) + >\z(c4x)) + Dz(o4x) 

• fzS-A = (fe6-3 = Oz(-c6x) + Cz(c2jO + Bz(-c2x)) + >U(c6x) 

• fel -4 = (/z1 -3 = Ez{c7)^ -h Fz(c5x) + G2(c9x)) + Hz(cx) 

• fe3-4 = (fe3-3 = Fz(-cx) + G2(-c7x) + H2(c3x)) + Ez(-c5x) 

• fe5-4 = /2S-3 = Gz(-cx) + Hz(c5x) ^ Ez(c3x)) + Fz(c7x) 

• fe7-4 = fe7-3 = Hz(c7x) + Hz(<») + Fz{c3k)) + Gz(-c5x) 

• Send (Az,Hz). (Bz.Gz). (Cz.Fz). (Dz.Ez) to South PE 



Claims 

^(1 . A data processing system, con^prising: 

a storage means for storing a plurality of instructions, each instruction including a first designatbn of a source 
register, a second designation of an execution unit operation, a third designation of an execution unit to output 
port routing, and a fourth designation of an input port to target register routing; 

a plurality of processing elements, each coupled by means of an instruction bus to said storage means, each 
of said processing elements receiving one of said instructions broadcast over said instruction bus; 

each of said processing elements comprising: 

an instruction register coupled to said instruction bus, for receiving said broadcast instruction; 

a register file coupled to said instruction register, said register file including a target register and a first operand 
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register which stores a first operand; 

an execution unit coupled to said instruction register, said first instruction designation controlling selective 
coupling of said first operand register said execution unit to provide to rt said first operand; 

said second instruction designation controlfing said execution unit to execute an operation on said first operand 
to produce a result operand; 

at least a first and a second output ports having outputs respectively coupled to a first and a second succeeding 
ones of said processing elements; 

a switching means coupled to said instruction register, said third instruction designation controlling said switch- 
ing means to selectively couple said execution unit to either the first output port or the second output port, to 
provide said result operand to the first succeeding processing element or to the second succeeding processing 
element, respectively; 

at least a first and a second input ports having inputs respectively coupled to a first and a secorui preceding 
ones of said processing elements, the first input port adapted to receive a first next operand from the first 
preceding processing element and the second input port adapted to receive a second next operand from the 
second preceding processing element; and 

said fourth instruction designation controlling said switching means to selectively couple said target register 
to either said first one of said input ports, to provide said first next operand to said target register or said secorKi 
one of said input ports, to provide said second next operand to said target register; 

whereby single instruction, multifile data processing can be performed. 

The data processing system of claim 1 , wherein said first succeeding processing element further comprises: 

a second instmction register coupled to said instruction bus. for receiving said broadcast Instruction; 

a second register file coupled to said instruction register, said second register file including a second target 
register; 

at least two input ports, including a receiving one of which is coupled to said first output port, for receiving said 
result operand; and 

a second switching means coupled to said second Instruction register, said fourth instruction designatbn con- 
trolling said second switching means to selectively couple said second target register to said receiving input 
port, to provide said result operand to said second target register; 

whereby single instruction, multiple data processing can be performed. 

The data processing system of claim 1 or 2, which further comprises: 

at least iwo execution units coupled to said instruction register, said first instruction designation controlling 
selective coupling of said first operand register said a first one of said at least two execution units to provide 
to it said first operand; 

said second instruction designation controlling said first one of said execution units to execute an operation 
on said first operand to produce a first result operand; and 

said third instruction designation controlling said switching means to selectively couple said first one of said 
execution units to either the first output port or the second output port, to provide said first result operand to 
the first succeeding processing element or to the second succeeding processing element, respectively. 

The data processing system of anyone of claims 1 to 3 which further comprises : 

an instruction sequencing means coupled to said storage means and to said instruction bus, for fetching said 
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instructions from said storage means and broad casting them to said plurality of processing elements. 

A data processing method, comprising: 

retrieving a plurality of instructions, each instruction including a first designation of a source register, a second 
designation of an execution unit operation, a third desigrtation of an execution unit to output port routing, and 
a fourth designation of an input port to target register routing; 

broadcasting one of said instructions to each of a plurality of processing elements; 

controlling with said first instruction designation, a selective coupling of a first operand register with an exe- 
cution unit in each of said processing elements, to provide a first operand; 

controlling with said second Instruction designation, said execution unit to execute an operation on said first 
operand to produce a result operand in each of said processing elements; 

controlling with said third instmction designation, a switching means in each processing element to selectively 
couple said execution unit to either a first output port or a second output port, to provide said resuft operand 
to a first succeeding processing element or to a second succeeding processing element, respectively; arKl 

controlling with said fourth instruction designation, said switching means to selectively couple a target register 
to either a first input port, to provide a first next operand to said target register or a second'input port, to provide 
a second next operand to said target register, 

whereby single instruction, multiple data processing can be performed. 
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Second Example Instruction Format for Communications 
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FIG 5B 
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FIG 5A 


FIG 5C 


FIG 5B 
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Single PE 
Configuration 
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(I, d represents the Input, Output, or bi-directional ports.) 



Nearest-Neighbor Communication Examples in a Single-PE Node 

FIG 6A 
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Nearest-Neighbor Communication Examples in a Duai-PE Node 
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4th Execution of Multiply Add and DSU Send Surrogate 
FIG 18 
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Execution of Butterfly with Clustered PE Destinallon Surrogate 
FIG 19 
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